There are not clear, correct answers for this lab. This key has some examples of good work from students in the class.
Part one: 50
Part two: 50
Extra credit: best 2012 predictions in part two using MSE
Up to -5 for a missing figure or insufficient discussion of it. For example, you got points off if you simply restated what was in a graph without adding any value to the presentation.
You might get some proportion of this deduction if the grader decides your models are essentially the same or lack sufficient justification for why you chose to make them apparently similar.
You might also get some portion of the deduction for bad practice: not setting up a training/test set or something similar.
-5 for failing to produce 2012 predictions in a separate csv file.
-7.5 each (max) for missing or insufficient discussion of a model. You won’t lose more than 15 points here.
If your predictions file did not include a variable called `cnt_pred’ exactly, as the prompt said to do, it wasn’t considered.
-7 if at least one of your markdown documents failed to knit.
-1 for grammatical or other written communications errors, such as poor or unclear syntax and dangling modifiers.
-2 each for having an illegible plot.
Be careful how you interpret linear model output. I made a number of comments for those of you who didn’t quite interpret p-values, R-squared etc. correctly.
This exercise was about prediction, and the major criterion for model selection in prediction is out-of-sample error by some measure. Here we used MSE. Many of you justified model choices with p-values, R-squared and other linear model characteristics fit on the training sample. That’s often not going to help you optimize predictions and can lead to overfitting. You might want to review the modeling and cross-validation lectures.
Some of you argued dropping variables with largish p-values, from a linear model fit on the training data, was one way to avoid overfitting. That isn’t true.
But when you multiply hour by temperature, you are getting an `interaction’ that is small for hours early in the day and large for hours late in the day, for a given temperature. Yet 2 am and 11pm are both in the middle of the night—and the EDA showed ridership was lower in the middle of the night regardless of temperature. Your model will have difficulty handling modeling those data because you have created very large hour times temperature variables (23 times temperature, say) but also small hour times temperature (2 times temp) for outcomes that are very similar.
A better solution to handle that issue would have been to create a factor variable for time of day. Or if that’s too many variables, you could have broken up the hour variable into a few categories, such as late-night, evening, midday, morning.
Using temperature times humidity as a way to capture their interaction does make sense, however.
The class won’t go too deeply into non-linear transformations to address such issues. Some fixes include fitting a generalized linear model or using a transformation that allows the transformed outcome better to meet the linear model requirements.
Still you should always be thinking critically about what your model is doing. Impossible predictions should be a red flag, and should prompt you to rethink your strategy. We will cover a variety of prediction methods in this class. If one is starting not to make sense, reflect on whether there is a better way.
If you don’t think that kind of thing matters in data science: It does. Errors like those make you less credible, particularly in the eyes of people who don’t understand your code and look for proxies to evaluate your competence.
Redundancy detracts from your presentation. When giving text with graphs, you shouldn’t need to describe what the graph is. That should be clear in the graph using labels, titles and legends. Text should give something more, some insight or take-away from the graph.
The EDA phase of a project is the perfect time to find data quality issues—and often there will be some. You don’t want to find problems when modeling, or worse, after making predictions. Think about what you are doing and check whether it makes sense with other information you have. A number of you pointed out the season variable was suspect. Great work.
When adding color to plots, convert to factors within the ggplot function to get discrete colors schemes for variables that are numeric but discrete in nature. For example, working day is a numeric variable taking values 0,1. But really, it is a categorical variable. If you use it to color a plot, the default in ggplot will be to use a continuous color scale. That makes the legend a little sloppy and more difficult to read. Use aes(… color = factor(workingday)) in ggplot to get a discrete, two-color scheme and legend.
Don’t split hairs when comparing MSE. Figuring out whether a small-ish difference in MSE is meaningful can be tricky. You might have several models with very very similar out-of-sample MSE. One way to compare them is to use content-based knowledge—one model might make more sense for a particular dataset, based on prior knowledge—and another is to look at other measures: other error statistics or measures of variability in your predictions.
BB: This graph is really well-made and clearly communicates an important aspect of the dataset.
Here we have the average number of bike rentals by the hour of day and divided by the day of the week. There is a clear difference between Saturday and Sunday and the rest of the week. On weekdays we see peaks at the beginning and end of the business day. On weekends, people seem more likely to go for a ride midday or in the afternoon.
hour11 <- read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/bikes_2011.csv')
# Make some integer columns factors
hour11 <- mutate(hour11, workingday = factor(workingday),
holiday = factor(holiday),
weathersit = factor(weathersit),
weekday = factor(weekday),
season = factor(season))
group_by(hour11, hr, weekday) %>%
summarize(cnt = median(cnt)) %>%
ggplot(aes(x = hr, y = cnt)) + geom_line(aes(color = weekday), size = 1) + labs(x = "Hour", y = "Average Number of Rentals") +
scale_color_discrete(name = "Day of the Week", labels = c("Sun", "Mon", "Tues", "Wed", "Thurs", "Fri", "Sat"))
BB: A number of you contemplated the difference between registered-user rentals and casual-user rentals. This graph does a good job showing the two counts seem to be strongly correlated, particularly when conditioning on whether it is a working day or not. That brings up a problem for modeling: Do you model them separately, but lose the information contained in their relationship? Do you model them together (using total counts) but miss the chance to have a more accurate model? Some of you did the former, some the latter.
This kind of scenario calls for trying hierarchical modeling, in which your model parameters are allowed to vary by subgroup and model information is shared across those groups—likely beyond the scope of this course. Still, you should be able to recognize that something unusual and important is going on when you see this graph.
If you’re interested: A good book on hierarchical models is `Data analysis using regression and multilevel/hierarchical models’, by Jennifer Hill and Andrew Gelman.
ggplot(data = hour11, mapping = aes(x = registered, y = casual)) + geom_point(aes(color = as.factor(workingday))) +
geom_smooth(method = "lm") + scale_color_discrete(name = "", labels = c("Non-Working Day", "Working Day"))
Although there appears ot be a positive relationship between the number of registered and number of casual users on a given day, the relationship doesn’t appear to be strictly linear. There appear to be two distinct trends, one for working days and one for non-working days. Registered users tend to rent bikes more on working days, while casual users tend to rent bikes more on non-working days. This matches what previous plots have exhibited.
BB: What’s good about this plot is it shows there likely is a difference in the effect of temperature based on the hours of the day. You should remember this plot when you get to the modeling phase, and test out some form of interaction between hour and temperature. Temperature doesn’t have an effect on ridership if nobody is riding anyway.
# data from 2011
hour11 <- read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/bikes_2011.csv')
hour11 %>%
ggplot() +
geom_jitter(aes(x=temp, y=cnt), width=.2, height=0, size=.5) +
ggtitle('Temperature on Rental Count by Hour') +
labs(x = 'Temp', y = 'Rental Count')+
facet_wrap(~hr, nrow=6)
BB: This is interesting as a good attempt to understand what a number of you wondered about: Is the season variable mis-coded, where winter should be marked 1? Are these quarters of the years not seasons? The seasons each have four months represented, so it’s possible the season variable is based on the actual dates of when the seasons start and end. That was something you could have checked with the dteday variable. Only one person did.
Others among you did plots like the second one.
# data from 2011
hour11 <- read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/bikes_2011.csv')
filter (hour11, holiday == 1) %>%
ggplot(aes(x=season)) +
geom_histogram() + ggtitle('Figure 3') + labs(x='Season')
#ggplot(hour11, aes(x=season)) + geom_histogram()
Figure 3 is an attempt at trying to understand the data and possibly see if their is a mistake with the key for the numbering of the seasons. To obtain this plot, I used dplyr’s filter function to filter out the data points that have 1 as the value for the holiday column, meaning that it was a holiday that day. This totaled to 239 observations from the full set. From there, I created this histogram, with seasons as the x - axis value. So, it is showing us the number of observations that fall within each season, when it is a holiday. From the plot, we can see that seasons 1 and 4 are pretty much equal and seasons 2 and 3 are equal. So, there are more holidays in season 1 and 4. We can take a look at the list of federal holidays in the US, such as from the following website:http://dchr.dc.gov/page/holiday-schedules, given in the website with the archives and dataset information. From this list, there are four holidays in the winter (Dec - March), one in spring (March - June), two in summer (June - September), and three in fall (Sept - Dec). So, according to that, the most should be in season 4 (winter), and the least in season 1 (spring). This graph shows that holidays fall the most in the spring and winter. So, from this graph and from our doubt of incorrect key in Figure 2, it is possible that season 1 is not supposed to indicate spring. But, we cannot say that for sure, as the dataset could be considering different months for the seasons.
hour11 %>%
ggplot(aes(x=hr, y=cnt)) +
geom_jitter(aes(color = factor(season)), width=.2, height=0, size=.3) +
ggtitle('Figure 4') +
labs(x = 'Hour', y = 'Rental Count', color = 'Seasons')+ facet_wrap(~season)
Figure 4 looks at hour versus rental count, faceted by the four seasons. It shows that seasons 2, 3, and 4 have similar levels of count. Season 1 is much lower than the other three. The highest count value for season 1 is about 400, which is about 200 less than that of the highest values of the other three seasons. Again, if we are going with the key given, and season 1 is spring, the reason that there is such a low is possibly because of allergies and the pollen. Maybe if people are going to work, they don’t want to get on bikes that have been out and are covered in yellow pollen, before they get to work. In seasons 2, 3, and 4, there is a small arch/curve around hours 15 - 20, with a majority of the counts above 150/200. This could relate to the time of rush hour and people getting back home from work.
BB: In addition to making a good point about the season variable, this one is a good example of when jitter is helpful. The x coordinates are discrete, so adding jitter a bit left or right doesn’t change our interpretation of the (x,y) value but does make the plot more legible.
hour11 %>%
ggplot() + geom_jitter(aes(x=season, y = temp, color = season), width = .02, size = .005) + ggtitle("Season vs Tempature")
Figure 2 shows the seasons per temperature for 2011. This plot is important because it shows a discrepancy in the data. For example, we see that season 3 - which is supposed to be Fall - has the highest temperatures, and season 1 - Spring - has the lowest. After looking closer at the data, we see that the seasons are actually split up according to “every three months” and not the actual calendar dates for the seasons. Therefore, season is an unreliable variable.
BB: Many of you did some looking into whether humidity and temperature together affect ridership. This plot does a great job putting a finger on the fact that there are pretty clear humidity-temperature regions of ridership levels.
A note I made to several people: Jittering isn’t typically a good idea when the values of both your x and your y coordinates are on a continuous scale. Doing so could change the meaning of the (x,y) coordinate. Examples: If your x-axis is categorical, then jittering a little to the left or to the right doesn’t change which category you view that point to be in. But if your x-axis is continuous, then adding +.5 to the x coordinate muddies the picture, since (x+.5, y) does not have the same meaning as (x,y).
It’s a toss-up here as to whether to use it or not, since there seems to be a fair number of points plotted at the same (x, y) coordinates.
hour11 %>%
ggplot(aes(temp, hum)) +
geom_jitter(aes(color = cnt)) +
scale_color_gradientn(colors = rainbow(5)) +
ggtitle('Rental count as related to Temperature and Humidity')
This jitter plot in figure 4 allows us to visualize three variables at once. it shows where rental count falls on a grid of temperature and humidity. It shows us that there are the fewest number of rentals when temperarure is low and humidity is high. Rentals peak around .75 normalized temperature and .5 normalized humidity. Since these highest rental counts are centered around those climates, with rentals steadily decreasing as climate deviates you could expect the most rentals with medium humidity and moderately-high temperature.
# BB: Without jitter
hour11 %>%
ggplot(aes(temp, hum)) +
geom_point(aes(color = cnt)) +
scale_color_gradientn(colors = rainbow(5)) +
ggtitle('Rental count as related to Temperature and Humidity')
BB: A number of you found work-arounds to display registered and casual rentals together on a graph. This code does so by creating a new data frame with categorical variable for casual/registered, then using conditional fill colors in ggplot. Nice work. The code below uses the reshape2 package, tidyr’s big sister. You probably could have done so with tidyr too, but I haven’t tried.
library(reshape2)
weekday_diff <- hour11[, c("weekday", "casual", "registered")]
weekday_diff <- melt(weekday_diff, id = c("weekday"))
ggplot(weekday_diff, aes(x = weekday, y = value, fill=variable)) +
geom_bar(stat="identity", position = "dodge") +
ggtitle('Figure 3') +
labs(x = 'weekday', y = 'casual & registered')
something about issue with outcome being non-negative and modeled by LM with likelihood on entire real line. try running a glm poisson model and compare it to the winning model.
BB: This is pulled directly from class notes, but it seems to work fairly well and is fast to code. Watch out for overfitting when doing large-degree polynomials. Out-of-sample validation is key.
# data from 2011
hour <- read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/bikes_2011.csv')
# x data from 2012
hour12_x <- read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/bikes_12_x.csv')
hour <- hour %>%
mutate(workingday=factor(workingday),
holiday=factor(holiday),
weathersit=factor(weathersit),
weekday=factor(weekday))
n <- dim(hour)[1]
n_tr <- floor(n*.8)
tr_indices <- sample(x=1:n, size=n_tr, replace=FALSE)
train <- hour[tr_indices, ]
test <- hour[-tr_indices, ]
d_max <- 25
models <- list()
error <- tibble(degree=1:d_max,
MSE_tr = rep(0, d_max))
for(d in 1:d_max){
models[[d]] <- lm(cnt ~ poly(temp,d), train)
mse_tr <- mean(models[[d]]$residuals^2)
error[d, 'MSE_tr'] <- mse_tr
}
ggplot(error)+
geom_point(aes(x=degree, y=MSE_tr)) +
geom_line(aes(x=degree, y=MSE_tr))
error <- error %>%
add_column(MSE_tst=rep(0, d_max))
for(d in 1:d_max){
model <- models[[d]]
test_results <- test %>%
mutate(cnt_pred = predict(model, newdata=test)) %>%
mutate(resid_sq = (cnt-cnt_pred)^2)
mst_tst <- summarise(test_results, mse_tst = mean(resid_sq))[[1]]
error[d, 'MSE_tst'] <- mst_tst
}
mst_tst <- summarise(test_results, mse_tst = mean(resid_sq))[[1]]
error %>%
rename(tr=MSE_tr, tst=MSE_tst) %>%
gather(key=type, value=error, tr, tst) %>%
ggplot() +
geom_point(aes(x=degree, y=log10(error), color=type)) +
geom_line(aes(x=degree, y=log10(error), color=type))
BB: This response did not use much in terms of formal variable selection procedures. That’s fine. It did do a good job making sensical choices based on the EDA and—most importantly—using MSE from the test set to justify which models to drop and which to keep.
Before creating my models, I read in the data and broke the 2011 data into train and test sets. I also changed categorical variables to factors so that I could use them in my models.
library(tidyverse)
hour11 <- read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/bikes_2011.csv')
hour12_x <- read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/bikes_12_x.csv')
hour11 <- mutate(hour11, workingday = factor(workingday), weathersit = factor(weathersit), mnth = factor(mnth), season = factor(season))
hour12_x <- mutate(hour12_x, workingday = factor(workingday), weathersit = factor(weathersit), mnth = factor(mnth), season = factor(season))
#split into train and test sets
tr_indices <- sample(x=1:nrow(hour11), size=floor(nrow(hour11)*0.8), replace=FALSE)
train <- hour11[tr_indices,]
test <- hour11[-tr_indices,]
First Model
From the exploratory data analysis in part 1 of the assignment, I know that the number of bike rentals varies with different values for weather, time of day, date, and type of day. One option would be to fit a linear model using all 13 possible variables available in the hour12_x dataset. However, this is probably too many variables and will result in overfitting. Additionally, many of the variables appear to represent the same underlying factor influencing number of bike rentals. For example, temp and atemp are very similar, and holiday and weekday probably combine to represent nearly the same thing as workingday. The change in number of rentals by season and month most likely are due almost entirely to changes in temperature.
For these reasons, I chose to base my first model on the hr, workingday, and atemp variables. For the first model, I chose to do a simple linear model with these three variables.
first_model <- lm(data = train, formula = cnt ~ workingday + hr + atemp)
To check my intuition about excluding other explanatory variables, I tried adding hum, windspeed, holiday, weekday, mnth, and season (one at a time) to the model. In each case, the change to the test error was negligible, so I left them out of my model.
Second Model
For the second model, I used the same three explanatory variables and added weathersit, but I made some polynomial terms and some interactions. I considered including weathersit in my first model, but since there was only one instance of level 4 in the 2011 data, the prediction with the test data failed whenever that observation was in the test and not training set.
Because my EDA plot of temperature vs. count appeared to have a quadratic form, I changed atemp to be a quadratic variable, rather than a linear one. Because it also appeared that bike rentals followed two very different patterns on working and non-working days, I chose to make workingday an interaction with hr rather than two separate terms.
The final change involved changing hr to a polynomial term and choosing what degree polynomial to make it. In the predictive modeling lecture notes, we saw that test error was strictly decreasing as the degree of the hr polynomial term increased. However, my model may perform differently when I add the workingday interaction. I chose a 7-degree polynomial for two reasons: the decrease in test error slowed after 7 degrees in the lecture notes, and the shape of the hourly trend on working days appears to follow the general shape of a 7-degree polynomial.
Also, although the test error was strictly decreasing in the lecture notes, a 20+ degree polynomial still seems like it would be prone to overfitting. With these decisions made, I fit the second model:
second_model <- lm(data = train, formula = cnt ~ poly(hr,7) * workingday + weathersit + poly(atemp,2))
Comparing the Models
After fitting both models, I need to compare them to determine which is better. To do so, I calculated the MSE from predictions for the test set that I set aside at the beginning.
#predict for each model
test_results_m1 <- test %>% mutate(cnt_pred = predict(first_model, newdata=test)) %>% mutate(resid_sq = (cnt-cnt_pred)^2)
test_results_m2 <- test %>% mutate(cnt_pred = predict(second_model, newdata=test)) %>% mutate(resid_sq = (cnt-cnt_pred)^2)
#calculate and compare MSE
mse_test1 <- summarise(test_results_m1, mse_tst = mean(resid_sq))[[1]]
mse_test2 <- summarise(test_results_m2, mse_tst = mean(resid_sq))[[1]]
mse_test1
## 1
## 12153.81
mse_test2
## 1
## 5847.841
From this, we see that the test MSE is much lower for the second model than the first. This makes sense because the first model was purely linear, while most of the relationships between explanatory variables and the number of bike rentals appeared in my EDA plots to be non-linear. I will therefore choose the second model to make my predictions for the 2012 data.
BB: One of you use regression trees to model the count data. Neat. To learn more about regression trees, see chapter 9 of The elements of statistical learning, by Hastie, Tibshirani and Friedman.
This person also scaled and centered some of the predictors—which sometimes is a good idea, sometimes not.
library(rpart)
hour11 <- read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/bikes_2011.csv')
hour12_x <- read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/bikes_12_x.csv')
# Data preprocessing
hour11_cnt <- hour11$cnt
hour11 <- as.data.frame(apply(hour11[, -c(1:2, 4, 15:16)], 2, scale))
hour11$cnt <- hour11_cnt
# Spliting the 2011 data into a train and test set
inTrain <- sample(c(1:nrow(hour11)), floor(8645 * 0.8))
training <- hour11[inTrain, ]
testing <- hour11[-inTrain, ]
testing_cnt <- testing$cnt
testing$cnt <- NULL
model2 <- rpart(cnt ~ ., data = training[, -c(1:2, 4, 15:16)])
printcp(model2)
##
## Regression tree:
## rpart(formula = cnt ~ ., data = training[, -c(1:2, 4, 15:16)])
##
## Variables actually used in tree construction:
## [1] atemp hr temp workingday
##
## Root node error: 124334410/6916 = 17978
##
## n= 6916
##
## CP nsplit rel error xerror xstd
## 1 0.311992 0 1.00000 1.00033 0.0199844
## 2 0.143494 1 0.68801 0.68829 0.0155605
## 3 0.056859 2 0.54451 0.54610 0.0125818
## 4 0.030887 4 0.43080 0.43305 0.0100001
## 5 0.018361 7 0.33814 0.34087 0.0089413
## 6 0.015115 9 0.30141 0.30472 0.0082102
## 7 0.014780 10 0.28630 0.29001 0.0079972
## 8 0.011376 11 0.27152 0.27334 0.0074686
## 9 0.010000 12 0.26014 0.26418 0.0071756
test_pred2 <- predict(model2, testing[, -c(1:2, 4, 15:16)])
test_pred2 <- floor(test_pred2)
# put the actual and predicted counts in a data frame
results2 <- tibble(cnt_actual = testing_cnt,
cnt_pred = test_pred2)
results2 %>%
mutate(resid = cnt_actual - cnt_pred) %>%
mutate(resid_sq = resid^2) %>%
summarise(MSE = mean(resid_sq))
## # A tibble: 1 × 1
## MSE
## <dbl>
## 1 5386.208
BB: This model does a great job addressing what many of you pointed out—that temperature and other factors likely will have different effects on ridership based on time of day. It also used some non-linear terms for the hour variable, as many of you did.
MSE for this model: 19,433.
MSE statistics for 2012 predictions, among the valid ones:
Notice the MSEs here are substantially larger than what you found, in general, even for the out-of-sample error. This is why doing cross-validation is so important.
Setup
# data from 2011
hour11 <- read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/bikes_2011.csv')
#Compute rush hour indication, daytime indication
hour11 <- hour11 %>%
mutate(rush_7 = workingday & !holiday & hr == 7) %>%
mutate(rush_8 = workingday & !holiday & hr == 8) %>%
mutate(rush_5 = workingday & !holiday & hr == 17) %>%
mutate(rush_6 = workingday & !holiday & hr == 18) %>%
mutate(rush_7pm = workingday & !holiday & hr == 19) %>%
mutate(norm_hr = hr > 5) %>%
mutate(daytime = (hr > 5 && hr < 22)) %>%
mutate(hr_centered = (hr + 19) %% 24)
set.seed(100)
n <- dim(hour11)[1]
n_tr <- floor(n * .8)
tr_indices <- sample(x=1:n, size=n_tr, replace=FALSE)
hour11_train <- hour11[tr_indices, ]
hour11_test <- hour11[-tr_indices, ]
# x data from 2012
hour12_x <- read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/bikes_12_x.csv')
Explanation
For my regression, I decided to model the general pattern of the day with some form of inverted u shape, either a sinusoidal or quadratic function. I offset this function from the peak time of ridership in the training data, typically around 5/6 for registered users and 3 for casual users. Then, I added an interaction with workingday as there are two distinct sinusoidal patterns in the data. On weekends and holidays, users tend to rent more frequently regardless of registered state. I then scaled the result of this general pattern according to the temperature and weather conditons.
I also created indicator variables for key points in time, mostly rush hour time and daytime. I also included interactions between the indicator variables and workingday as they should not be contributing when there is no rush hour (weekend or holiday). I also included a holiday indicator variable as I hypothesized that more users were out on holidays than normal weekends. Lastly, I included an indicator variable for winter in case it was different due to weather conditions not noted or just a general preference to be in a car regardless of temperature.
Additionally, because of the different nature of registered and casual users, I modeled them separately and then added them together. After training the models, the quadratic fit performed just slightly better than the sinusoidal models, so I used them for generating my predictions.
Modeling BB: code related models other than the selected one omitted.
lin_reg <- lm(registered ~ weathersit * atemp * (
I((12 - abs(hr_centered - 12))^2) * workingday +
I((12 - abs(hr - 12))^2) * I(!workingday) +
rush_5 * workingday +
rush_6 * workingday +
rush_7pm * workingday +
rush_7 * workingday +
rush_8 * workingday +
daytime + I(season == 1) * norm_hr), data=hour11_train)
pred_11 <- predict(lin_reg, newdata = hour11_test)
hour11_test <- hour11_test %>%
mutate(reg_pred=pred_11) %>%
mutate(resids = (registered - reg_pred)^2)
ggplot() +
geom_jitter(data = hour11_test, mapping = aes(x = hr, y = registered, color = 'reg')) +
geom_point(data = hour11_test, mapping = aes(x = hr, y = reg_pred, color = 'pred')) +
ggtitle('Predicted registered riders (quadratic)')
print("Mean squared residuals:")
## [1] "Mean squared residuals:"
mean(hour11_test$resids)
## [1] 1887.519
print("R squared of model on train data:")
## [1] "R squared of model on train data:"
summary(lin_reg)$r.squared
## [1] 0.8365196
Predictions
hour12_x <- hour12_x %>%
mutate(rush_7 = workingday & !holiday & hr == 7) %>%
mutate(rush_8 = workingday & !holiday & hr == 8) %>%
mutate(rush_5 = workingday & !holiday & hr == 17) %>%
mutate(rush_6 = workingday & !holiday & hr == 18) %>%
mutate(rush_7pm = workingday & !holiday & hr == 19) %>%
mutate(norm_hr = hr > 5) %>%
mutate(daytime = (hr > 5 && hr < 22)) %>%
mutate(hr_centered = (hr + 19) %% 24)
lin_reg_cas <- lm(casual ~ (
I((12 - abs(hr_centered - 12))^2) * workingday +
I((12 - abs(hr - 12))^2) * I(!workingday) +
norm_hr * I(season == 1) + daytime + holiday)
* weathersit * atemp, data=hour11_train)
pred_cas <- predict(lin_reg_cas, newdata = hour12_x)
hour12_x <- hour12_x %>%
mutate(cas_pred=pred_cas)
lin_reg_reg <- lm(registered ~ weathersit * atemp * (
I((12 - abs(hr_centered - 12))^2) * workingday +
I((12 - abs(hr - 12))^2) * I(!workingday) +
rush_5 * workingday +
rush_6 * workingday +
rush_7pm * workingday +
rush_7 * workingday +
rush_8 * workingday +
daytime + I(season == 1) * norm_hr), data=hour11_train)
pred_reg <- predict(lin_reg_reg, newdata = hour12_x)
hour12_x <- hour12_x %>%
mutate(reg_pred=pred_reg)
hour12_x <- hour12_x %>%
mutate(cnt_pred = pred_reg + pred_cas)